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Suppose that two large, high-dimensional data sets are each noisy measurements of the 
same underlying random process, and principle components analysis is performed separately 

(_J on the data sets to reduce their dimensionality. In some circumstances it may happen that 

^ the two lower-dimensional data sets have an inordinately large Procrustean fitting-error be- 

^ tween them. The purpose of this manuscript is to quantify this "incommensurability phe- 

c/) nomenon." In particular, under specified conditions, the square Procrustean fitting-error of 

^ the two normalized lower-dimensional data sets is (asymptotically) a convex combination (via 

>■ a correlation parameter) of the Hausdorff distance between the projection subspaces and the 

^ maximum possible value of the square Procrustean fitting-error for normalized data. We show 



On 



O 



how this gives rise to the incommensurability phenomenon, and we employ illustrative simu- 
lations to explore how the incommensurability phenomenon may have an appreciable impact. 

Keywords: Procrustes fitting, high-dimensional data, principal components analysis, Grass- 
mannian, Hausdorff distance, incommensurability phenomenon. 



1 Overview and Outline 

The ever-increasing importance of modern big-data analytics brings with it the imperative to 
understand fusion and inference on multiple and massive disparate, distributed data sets. What 
processing can be profitably done separately, for subsequent joint inference? In the case where 
each data set consists of measurements on the same objects, and combining the full data sets 
is prohibitively expensive, it seems reasonable to separately project each large, high-dimensional 
collection to a low-dimensional space, and to then combine the representations. Unfortunately, 
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this model can lead to undesirable incommensurability with significant deleterious effects on fusion 
and inference. In this manuscript, we quantify an appearance of this phenomenon. 

In Section[2]we begin with an idealized Tale of Two Scientists and its accompanying Theorem[T| 
in order to pave the way for our main result, Theorem [2] — stated and proved in Section |3] — wherein, 
under more general conditions, an asymptotic relationship is given between the Procrustean fitting- 
error of the two lower dimensional data sets and a distance between the projections. 

Then, in Section |4| we perform simulation experiments to illustrate and support our main result 
Theorem |2| and we use these simulations to explore the implications of Theorem [2j in particular, 
when there is an insufficient spectral gap in the covariance structure at the projection-dimension 
cutoff, then large projection distance may result between the projections for the two data sets, and 
inordinately large Procrustean fitting-error then follows. This "incommensurability phenomenon" 
was named in Priebe et al. [1]. 

2 A cautionary Tale of Two Scientists 

For this section only, we explore an idealized scenario for the purpose of illustration; the general 
setting will be treated in Section |3} For this entire manuscript, a general background reference 
for matrix analysis tools that we employ (eg Procrustes fitting, singular value decomposition, 
spectral and norm identities and inequalities such as Weyl's Theorem for Hermitian matrices 
and Interlacing inequalities for Hermitian matrices) is the classical text p5] , background on the 
Grassmannian (eg principal angles, Hausdorff distance) useful for our particular work is easily 
accessible in |S], and background on principal components analysis (henceforth "PGA") can be 
found in [T]. A classical and broad textbook on the Grassmannian is [2]. 

Suppose that two scientists each take daily measurements of m features of a random process, 
where m is a large, positive integer. For each day i = 1, 2, 3, . . ., the first scientist records her daily 
measurements as X*-*-* G M™, where xj*'' is her measurement of the jth feature, and the second 
scientist records his daily measurements as Y^*) G M™', where Y^*"* is his measurement of the jth 
feature, for j = 1, 2, . . . , m. Although the two scientists want to record the same process, suppose 
that their measurements are made with some error, which we model in the following manner. 

There are three collections of random variables {Z^*'*}, {Z'^*''}, and {Z"^*^}, each over indices 
z = 1, 2, 3, . . . and j = 1, 2, . . . , m, such that these random variables are all collectively independent 
and identically distributed, and their common distribution has finite variance a > 0. Suppose 
that the random variables {Z^*^} are the signal feature values associated with the process that the 
scientists would like to record, and the random variables {Z'^*^} and {Z"^*^} are confounding noise. 
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Let a real- valued "measurement-accuracy" parameter 7 be fixed in the interval [0, 1]. One scenario 
is that for each day i = 1, 2, 3, . . . and feature j = 1,2, ... ,m, the first scientist's measurement 
X^*^ is a mixture of zj*^ and Z'j*^ with respective probabilities 7 and 1 — 7, and the second 
scientist's measurement Y^*"* is a mixture of zj*^ and Z"^*"* with respective probabilities 7 and 
1 — 7. A second scenario is that, instead, for each day i = 1, 2, 3, . . . and feature j = 1,2, ... ,m, 
Xf = 7 ■ Zf + ^/l^-Zf and Yf = 7 ■ zf^ + ^/l^-Z'f. The main result of Section g is 
Theorem[T| which will hold in either of these two scenarios. At one extreme, if 7 = 1, then the two 
scientists' measurements are perfectly accurate and X^*) = Y*^*) for all i. At the other extreme, if 
7 = 0, then the two scientists' measurements are independent of each other. 

For each positive integer n, denote by the matrix [X(i)|X(2)| . . . X(")] G M™''" consist- 
ing of the first scientist's measurements over the first n days, and denote by Y^"^^ the matrix 
[Y(-'^)|Y*^^^| ■ ■ ■ Y*^")] G M™-^" consisting of the second scientist's measurements over the first n days. 

Because the measurement vectors are in high-dimensional space M™, suppose the scientists 
project their respective measurement vectors to the lower-dimensional space M'^ for some smaller, 
positive integer k. This is done in the following manner. Let if„ = In — \ Jn denote the centering 
matrix and J„ are, respectively, the x n identity matrix and the matrix of all ones). Suppose 
that the first scientist chooses a sequence A^^\ A^'^\ A^^\ ... of random (or deterministic) elements 
of the Grassmannian Qk,m (the space of all /c-dimensional subspaces of M™"), and suppose that the 
second scientist chooses a sequence B^^\B'^'^\B'^'^\ . . . of random (or deterministic) elements of 
the Grassmannian Qk,m- No assumptions are made on the distributions of these elements of the 
Grassmannian or on their dependence/independence, but one example of interest is where, for 
n = 1, 2, 3, . . ., A^'^\ B^'^^ G Gk,m denote the respective fc- dimensional subspaces to which principal 
components analysis (PGA) projects X^'^'^Hn and Y^^^Hn, respectively. Let Py^(n) denote the 
projection operator from M™ onto A^^\ On each day n, the first scientist reports the scaled matrix 

:= ^n)u II -PA(")^^"^-^n G K^^" to the Governing Board of Scientists, and the second 

scientist reports the scaled matrix y^'^^ := yp ^ ||j, -Pg(")^^"''-^n ^ M^^" to the Governing 

Board of Scientists. (Nota Bene: The specific choice of y/k in the scaling || A:''^")||p = ||3^''"''||f = Vk 
is an innocuous notational convenience.) 

Now, the Governing Board of Scientists wants to perform its own check that the two scientists 
are indeed taking measurements reflecting the same process. So the Governing Board of Scientists 
computes the Procrustean fitting-error e(X^'^\y^'^^) := rainQ^-^kxk.QTQ^j^, WQX^"^^ ~y^"'^\\F- It will 
later be seen (from (|5|) that the square Procrustean fitting-error satisfies < e2(A'("), 3^(")) < 2k; 
the Governing Board of Scientists reasons that this square Procrustean fitting-error should be 
small (negligible compared to 2k) if indeed 7 is close to 1. Is this reasoning valid? 
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In the following, d{-, ■) denotes the Hausdorff distance on the Grassmannian Qk,m', in particular, 



for any A,13g Gk,m, d{A, B) = y X]i=i(2sin MAS)2 -y^^here {6i{A, B)}'^^-^ are the principal angles 
between A and B. Note that the square Hausdorff distance satisfies < d'^{A, B) < 2k. 



Theorem 1. Almost surely, e2(A'("), 3^(")) - (1 - 7^) ■ 2A; + 7^ ■ d2(^(n)^ ^W) 



— )■ as n 00. 



The proof of Theorem [T] is given later, in Section 3^, as a special case of the more general 
Theorem |2l 

Theorem [1] says that e'^{X^'^\y^^'') asymptotically becomes this convex combination (via 7^) 
of 2k and c/2(^(n)^ in particular, if 7 is close to 0, which implies that the two scientists' 

measurements are independent of each other, then indeed e'^{X^"\y^"^) is close to its maximum 
possible value 2k, but if 7 is close to 1, meaning that the scientists' measurements are close to 
being the same as each other, we then have e2(Af("), 3^(")) close to ci2(^(«)^ ^(«)). is this square 
Hausdorff distance close to zero when 7 is close to 1? 

In Section |4] we show that, in fact, if the principal components analysis projections are used 
then this may not be the case, and the square Hausdorff distance {A^'^\ B^'^^) might even be 
close to its maximum possible value of 2k. By contrast, if the two scientists both used the simple- 
minded projection consisting of just taking the first k coordinates of M™ and ignoring the rest of 
the coordinates, then d'^{A^"'\B^"'^) = 0, in which case 7 close to 1 would indeed yield e^(A''-"\ J^*^"-*) 
close to 0. 



3 The asymptotic relationship between Procrustean fitting- 
error and the projection distance 

The main result of this section is the statement and proof of Theorem |2} We begin with a 
description of a general setting and a list of basic facts that will be used in the proof of Theorem [2] 



3.1 Preliminaries and the general setting 

From this point and on, we will consider a much more general setting than the idealistic setting of 
Section § Suppose now that X^^), X^^), X^^), ... e M*" and Y^^), Y^^), Y^^), ... e M*" are random 
vectors (for convenience, let us denote X = X'-^^ Y = Y*^^^) such that the stacked random vectors 
[ y(i) ] ; [ y(2) ] ' [ y(3) ] ' ■ ■ ■ ^ ^^"^ independent, identically distributed, with covariance matrix 



Gov 



X 




Gov(X) Gov(X,Y) " 


g ■j^2mx2m 


Y 




Gov(Y, X) Gov(Y) 
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(We no longer require, in the manner of Section |2| that X and Y have independent, nor identi- 
cally distributed components, nor that they arise as a mixture of other random variables in any 
particular way.) Assume that Cov(X) and Cov(Y) are both nonzero matrices. 

Then define, for each positive integer n, random matrices X*-"^ := [X*^^)|X*^^)| ■ ■ ■ X^"^] G M™^"- 
and := [Y^^^\Y^^^ ■ ■ ■ Y^")] G M™''". Let -B^") G denote the respective fc-dimensional 
subspaces to which principal components analysis (PCA) projects X^^^Hn and Y^'^'^Hn- In the 
special case where Cov(X) and Cov(Y) are scalar multiples of Im, then we will explicitly allow 
{A^'^^'^^i, {iS^"-'}^]^ to be any sequences of elements in Gk,m whatsoever, deterministic or random. 

It is useful to consider the projections P_^(n) and Pg{7i) as mxm symmetric, idempotent matrices 
(i.e., keep the ambient coordinate system for the projection's range) and, for each n = 1,2,..., 

define X^^^ = P^wX^g^ G M"*^" and 3^(") := ^^^|yT^P^(„)r(")if„ G M'"^". 

(There is no difference for our results and for the Procrustean fitting-error if, as in Section [2| we 
instead treated P_4(n) and Pg(n) as functions — )■ with the coordinate systems of and 
g(n)^ respectively, in which case we have X^"'^ and 3^*^"^ in M'^^" instead of M™-^".) 

For any matrix C G M™^"* with only real- valued eigenvalues (eg, symmetric matrices), let 
Ai(C) > A2(C) > ••• > A,n(C) denote the eigenvalues of C. For any matrix C G M™''", let 
c"i(C') > 0-2(0) > • • • > cTmm{m,n}{C) deuotc the singular values of C. Recall that if C is symmetric 
and positive semidefinite (e.g., a covariance matrix) then Aj(C) = o"j(C) for all z = 1, 2, . . . , m, 
and recall that, for any C G M™^", D G M"^™-, the nonzero eigenvalues of CD are the same as 
the nonzero eigenvalues of DC. For any A,B ^ Qk,m (with projection matrices P^, Pg) and all 
z = 1,2, . . . ,m, we thus have ^^(P^Pb) = \{Pj^PsP^Pl) = HPaPbPbPa) = HPaPaPbPb) = 
Aj(P_4Pg). In fact, PaPb has at most k positive eigenvalues and at most k positive singular 
values (the rest of the eigenvalues and the rest of the singular values are all zero) and, for all 
i = 1, 2, . . . , /c. 



^.(P^Pb) = VHPaPb) = cos 9,iA,B), (1) 

where {^i(^,i3)}*L I are the principal angles between A and B. 

For each n = 1, 2, . . ., the Hausdorjf distance d{A^"'^B^"'^) is the nonnegative square root of 

^ a ( A{n) K?(n)\ ^ 

ci2(^("),i3(")) := ^2^sin^( ^ ^ = ^^(l - cos ^i(^("), iS^"))). (2) 

i=l i=l 

It is clear that < d'^{A'^'^\ ^(")) < 2k. We also define, for each n = 1, 2, . . ., the quantity 



i=l \ k 



Later, in Proposition [oj we will prove it always holds that < < 2k. Note that if 

Cov(X, Y) is a nonzero scalar multiple of then iS*^"^) is equal to (P{A^^\ B^"'^) and, in 

fact, if Cov(X,Y) is the zero matrix then we will define Q'^{A^'^\ B^^'^) = (f{A'''"'\B'^'^^) (because, 

indeed, -y^^ — t^ottt^ is not defined). For this reason, we like to view d'^(A^'^\B^'^^) as a 

weighted form of the square Hausdorff distance. 

For each n = 1, 2, . . . , the Procrustean fitting-error is defined to be 

e(A'("),y")) := min IIQA'^") - 3^(")||^. (4) 



In fact, it holds that 

m m 
^2^;^{n)^y{n)^ = || A'^^^ ||| + || ^ ||| - 2 ^ (Ti A'^")^) = 2A; - 2 ^ CT^ A'^")^) . (5) 



i=l i=l 



3.2 The result 

Within the setting of Section 3.1[ we now state and prove the main result of Section [3] 



Theorem 2. In the setting of Section 3.1, it holds almost surely that 

e2(A'("), 3^(")) - [(1 - p) • 2A; + p ■ Q\A^'^\ S^")) 
as n ^ CO, where p is defined as 

Ej=ia,(Cov(X,Y)) 

P ■■= 







EU ^.(Cov(X))y^E;=i ^.(Cov(Y)) 
In Proposition [7] we prove that < p < 1. To prove Theorem [2] we first establish Lemmas |3] and |4j 
Lemma 3. Almost surely, trace^P4(„)X(")i7„ifJX(")^Pj(„) ^ Eti ^*(Cov(X)) asn^oo. 
Proof of Lemma |3} For each n = 1, 2, 3, . . ., let us consider a singular value decomposition 

X(")/7„ = [/(")a(")i/(")^ 

where U^"^^ E M™^™ is orthogonal, A^") G M™^" is a "diagonal" matrix, with nonnegative diagonals 
non-increasing along its main diagonal, and V^"'^ G M"^" is orthogonal. By the definition of PCA, 

where E G M*"^™- is the diagonal matrix with its first k diagonals 1 and its remaining diagonals 0. 
Thus, the matrix 



and the matrix 



share their k largest eigenvalues, and the remaining m — k eigenvalues of the latter matrix are 0. 
By the Strong Law of Large Numbers, almost surely ^^X^^^if^i^JX*^")^ — t- Cov(X), hence we 
have trace^P^(,oX(")i7„i7jX(")^Pj(„) ^ ^ti Ai(Cov(X)) = ^^i ^*(Cov(X)) as n ^ oo. 

Lastly, if Cov(X) = a ■ Im for some a > then recall that we explicitly allow {A^'^^}'^^i to be 
any elements of Qk,m', in this case note that by the boundedness of {P_4(„)},^]^ and the Strong Law 
of Large Numbers that, as n — )• oo, 

trace^-P^(„)X(")if„i/Jx(")^PT(„) = 
n — 1 



a 



■ traceP^(n) + traceP^(„) f-^X^'^'^HnH^X^''^^ - a ■ O Pj{n) ^ ak = J] ai(Cov(X)), 

^ ^ i=i 



as desired. □ 

Lemma 4. For i = 1,2, ... ,171, almost surely 

^2^-y(n);^(n)T^ _ ^ . ^2 (p^^^^^ Cov(X, Y)Pg(„) ) 

as n ^ oo, where 5 := ,^ ^ i 

Proof of Lemma |4| For each n = 1,2, . . ., expand the expression 3^(") A:'(")'^(3^(") Af*^")-^)-^ by the 
definitions to write it as 3;(")A'(")^(3^(")A'(")^)^ = ■ where (j)'-''^ and are defined by 

k^ 



(n) 



trace^P^(„)r(")/7„/fjF(")^Pj,„) ■ trace^P^(„)X(")/f„/7jX(")^Pj,„, 



and 



:= ( ^r(")//„i/Jxwn PJ,„)P^,„) ( ^xW/7„/7:y(")^ ) Pj,, G 



nmxm 



Define ^ := ^X(")i7„ifjy(")^ - Cov(X, Y); by the Strong Law of Large Numbers, almost 
surely ^l/^y — )■ as n — )■ oo. Thus, by the subadditivity and submultiplicativity of the norm, and 
by the boundedness of {P^{n)}^i and {PB{n)}^i, we have almost surely that 

||$(") -Pe(„)Cov^(X,Y)Pj,„)P^,„)Cov(X,Y)Pj(„J^ = ||Pe(„)^5^JpJ(„)P4(„)^JV^J(„) 

+ ^x,y" ^Af^"^ Cov(X, Y)Pj(„) 

+ -Pb(")Cov'^(X, Y)Pj(„)P_4{„)\E'^^yPj(„) \\f 



as ri — )■ oo. Now, by Lemma [s] and the definition of almost surely <^^"^ — )■ 5 as — )■ oo, hence 
by ^ and the boundedness of {-P4(n)}^i and {Pbm}'?^=i, we have almost surely that 

. $W _ s . Pg,„,Cov^(X, Y)Pj(„)P4(„)Cov(X, Y)Pj(„)||,. 

< -Pg(„)Cov^(X,Y)Pj(„,P^(„)Cov(X,Y)Pj(„))||p 

+ 11 - 5) ■ Pe,„)Cov^(X, Y)Pj(„)P^(„)Cov(X, Y)Pj(„J^ ^ 

as n — )■ 00. Thus, by Weyl's Theorem for Hermitian matrices, for each i = 1, 2, . . . , m we have 
almost surely that 

A, ■ ^^'^^l - aJ5 ■ (p^(„)Cov(X, Y)Pj,„y P_4(„)Cov(X, Y)Pj(„)j ^ (7) 



as n — 7- 00, from which Lemma |4] follows, after noting that Pg(n) is symmetric. □ 

We are now able to prove the main result of this section. Theorem |2} 

Proof of Theorem [2} Let 6 be as defined in Lemma |4j Note that for any nonnegative, bounded 
real sequences {a(")}^^^i and {fe^^^j^^^, it holdt^that a^")-^^") if and only if a/oW-a/^W 0, 
as — 7- 00. Thus, by Lemma |4| and noting that the rank of P_4(n)Cov(X, Y)Pg(n) is at most k, we 
have almost surely that, as n — )■ cxd. 



- v^- ^(T,(P^,„,Cov(X, Y)Pe(„)) ^ 0. 

1=1 i=l 

But the expression in ^ can be simplified, by ([s]) and ([3]), as 

m k 

2k - 2^a,(3^(")A'(")^) - \2k - 2v^^(T,(P_4(„,Cov(X, Y)Pg(„) 



(8) 



i=l 



e\x(-\y(-^)-\2k~2pJ2{T^ 



r Vi^;=i^.(Cov(X,Y)) 



aJP^(„)Cov(X,Y)P, 



e2(A'("),3^("))- (l-p).2fc + p-5^2 1-- 



1 



.=1 V ,E;=i^.(Cov(X,Y)) 
;i - p) •2A; + p-g2(^W,i3W 



ai(p^(„)Cov(X,Y)PB(„) 



-'^Indeed, because a^"' and fo^"-* are bounded, and since |a*^"^ — = |\/a(") — Vb(^\ ■ \V a(") + v^6(^|, we have 
that %/ a(") — Vfef") — > imphes a*^"-* — fe*^"^ — >■ 0. (Without the boundedness assumption this imphcation may not 
hold.) Conversely, if \/a(") — 7^ 0, then there exists c > such that |\/a("*) — Vb^^\ > c for a subsequence, 

in which case |a(") - &(")| = - Vb(^\ ■ \ Vd^ + Vb(^\ > c ■ c, hence a(") - 0. 
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which estabhshes Theorem [2l □ 



There is a special case of Theorem [2] that deserves attention: 



Theorem 5. In the setting of Section 3.1, if Coy (X.) = Cov(Y) and Cov(X, Y) = for a real 
number (3, then it holds almost surely that 



a' 







as n ^ CO, where a' := ^ Yl'j=i '^ji^'^^PQ) ■ 



Theorem [5] is an immediate consequence of Theorem [2| since we previously pointed out that when 
Cov(X, Y) is a scalar multiple of the identity then ^^(^(n)^ ijW) = d'^{A^''\ iS^")). □ 



Finally, Theorem [T] from Section [2] is an immediate consequence of Theorem |5} after noting that 



the setting of Section |2] is a special case of the setting of Section 3.1, with (recall the definitions 
of a and 7 from Section [2| 



Gov 



X 




a 




7^ ■ a 




g "1^2™ X 2m 


Y 




7^ ■ a 




a 







So and a' of Theorem [sj are, respectively, 7^-0; and a, thus in Theorem 
This proves Theorem [1} □ 



we have 



2 



Y 



3.3 Bounds for 9^ and p 

Proposition 6. For 52(^("),i5W) as defined m (gj, it holds that < < 2k. 

Proof of Proposition |6j The upper bound is trivial. To prove the lower bound, first we re- 
express ^ as 

k 

g2(^(n)^^W) ^ ^— -— ^ (a,(Cov(X,Y)) -a,(p,(„)Cov(X,Y)Pg,„))) , (9) 

i Ei=i (Cov(x, Y)) fri' V V y y 

and we show that each summand in the summation of ^ will be nonnegative. Indeed, for any 
S G M™-^"^ and i = 1, 2, . . . , n, we have that ai{S ■ Pj^{n)) < crj(S') and ai{P_^{n)S) < <Ji{S); this is 
seen as follows. Say P4(n) = QEQ^ is such that Q G M™-^"^ is orthogonal and E is diagonal with 
I's and O's on its diagonal. Then a^{S ■ P^fn)) = A,(Pj(„)5^^P4(„)) = Xi{QEQ^ SQEQ^) = 
XiiEQ^S^SQE) < X.iQ^S^SQ) = Xi{S^S) = (xf{S), the inequality holding by the Interlacing 
Theorem for Hermitian matrices. By a similar argument (7i{P_^(n)S) < (Ti{S), and applying these 
in succession yields that crj(P4{„)Cov(X, Y)Pg(n)) < cri(Cov(X, Y)). □ 
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Proposition 7. For p, as defined in Theorem^ it holds that < p < 1. 



Proof of Proposition (t) Let Cov(X, Y) = UAV^ be a singular value decomposition; i.e. 
U,V E M™^"* are orthogonal and A G M™^"* is diagonal, with nonincreasing nonnegative diagonal 

f/Cov(X)f/^ A 

A VCoYiY)V^ 

where Om G M™-^™ is the matrix of zeros. A covariance matrix is positive semidefinite, thus M 
is positive semidefinite, as well as all of its principal submatrices. For each j = 1,2, . . . , k, the 
two-by-two submatrix consisting of the jth and j + mth rows and columns of M has nonnegative 
diagonals and a nonnegative determinant, thus {UCov(X.)U^)jj{VCov(Y)V^)jj > {AjjY, i.e. 



entries. 


Define M e 








M := 


' U 0^' 




Cov(X) Cov(X,Y) " 














Cov^(X,Y) Cov(Y) 









aj(Cov(X,Y)) < J (uCoviXpT^ ■ J(vCov{Y)VT 



(10) 



Now, summing (10) over j = l,2,...,k and applying the Cauchy-Schwartz inequality to the 



resulting right-hand side, we obtain 

k 



^a,(Cov(X,Y))< 
i=i 



(f/Cov(X)f/^) ■ ^ (yCov(Y)y^) . 



111 



For any Hermitian matrix, the vector of its diagonals always majorizes the vector of its eigenvalues, 
thus 



(f/Cov(X)f/^)^^ < ^A,(t/Cov(X)f/^) = ^a,(Cov(X)), 

j=i j=i j=i 



:i2) 



and Proposition follows from (11), (12), and (12) applied to Cov(Y) and V. □ 



3.4 An isometry-corrective property of 9^ 

Suppose that W G M."^^"^ is an orthogonal matrix such that 

Gov 



X 




' Cov(X) (3-1^ ' 


G 


WY 




_ (3-Im COV(X) 





) 2mx 2m 



where /3 G M is nonzero; this might arise in situations similar to the cautionary Tale of Two Scien- 
tists in Section |2] — wherein two scientists are taking measurements of the same random process — 
except that the second scientist permutes the order of the features (i.e., is a permutation 
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matrix). Define WB^'^^ := {Wx : x G ^B^")}. In this situation, the quantity (P {A^'^\W B'^'^'^) may 
be more interesting than the quantity d?'{A^'^\B^'^^)^ since vA*-"^ might be viewed as being more 
comparable to WB^"^^ then to B^'^\ Indeed, if the eigenvalues of Cov(X) are distinct and n is large 
and W is not /„, then c/2(^(")^ i^i^C")) would be small, in contrast to 

Proposition 8. In the case of the previous paragraph, we have = d^{A^''\WB^''^) . 

Proposition [8] will be illustrated in Section 4^ 



Proof of Proposition [8f Here we have 
Gov 



X 








' Cov(X) f3-Im ' 




I'm Om 




' Cov(X) f3-W 


Y 




0^ 




15 ■ Im Cov(X) _ 




0^ w _ 




P ■ Cov(Y) 



thus for alH = 1, 2, 



m 



a,(Cov(X,Y)) = (T,(/3-iy) = |/3| (13) 
and a,(P4(„)Cov(X, Y)Pb(„)) = ■ a,(P4,„) WP^w) = ■ cr,(P^(„)W^Pg(„) W^^) (14) 



Because Pv^bC") = VrPg{„)Vr^, and by (g, (g, (|T3|, and it follows that 
g2(^("),SW) = Eti2(l - ai(P^(„)iyPe(n)iy^)) = d2(^(«),H^i5W). □ 



4 Simulations 

In this section we use simulations to illustrate and support the theorems which we stated and 
proved in the previous sections, and we then use these simulations to illustrate how the "incom- 
mensurability phenomenon" can arise as a consequence. What is meant by this phenomenon is 
the occurrence of an inordinately large Procrustean fitting-error between projected data that was 
originally highly-correlated. (This phenomenon was named in Priebe et al |1].) 



4.1 A first illustration 

Our first illustration of Theorem[2]and Theorem[5]is with X and Y distributed multivariate normal 
(with mean vector consisting of all zeros) such that Cov(X) =Cov(Y) = Jg and Cov(X, Y) = Jg 
for assorted values of (3. Note that p as defined in Theorem |2] is (3 here, note that a' and (3 as defined 
in Theorem [s] are, respectively, 1 and l3 here, and note that here 5^(^''"\ i?*^"-*) = (P(A^"^\ B^'^'^) 
because Cov(X, Y) is a scalar multiple of the identity. Also, this example may be seen as an 
illustration of Theorem o] — in the Tale of Two Scientists — with 7^ there being (3 here. 
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The dimension of the space containing X and Y is m = 6, and we will project to spaces of 
dimension k = 2. 

For each of /3 = 0, .1, .2, .3, .4, .5, .6, .7, .8, .9, .99, and for each of n = 1000 and n = 10, 000 we 
obtained 1000 realizations of A"*^"^ and 3^*^"^ and used PCA to obtain and B^'^K In Figure [l| 
we plotted the values of e'^{X^'^\y^'^^) against the respective values of (P{A^'^\B^"^), in colors 
blue, green, red, cyan, magenta, blue, green, red, cyan, magenta, blue for the respective values 
of /3 = 0, .1, .2, .3, .4, .5, .6, .7, .8, .9, .99. For reference, we also included — in Figure [T] — lines with 
y-intercept (l — (3)-2k and slope /3, for each of the above-specified values of (3; basically, Theorem[l| 
Theorem |2} and Theorem [5] state that the scatter plots will adhere to these respective lines in the 
limit as n goes to oo. Indeed, notice in Figure [T] that the scatter plots adhere very closely to their 
respective lines, and such adherence substantially improves as n = 1000 is raised to n = 10000, 
which supports/illustrates the claims of Theorem [T| Theorem |2| and Theorem |5} 

The above was done using PCA to generate A^'^^ and B^'^\ What if we instead took A^'^^ and 
to (each) be the span of the first two standard-basis vectors in R^? We will call this the 
"trivial" choice of and S^"). Of course, the value of d'^{A^'^\ i?'-"-') would always be identically 
zero, and note that Theorem [l| Theorem |2| and Theorem [s] still apply with this choice of and 
B^^^ because Cov(X) and Cov(Y) are scalar multiples of the identity. Thus, the scatter plots from 
these above experiments when they are performed instead for the trivial choice of A^""^ and B^"'^ 
would land in the far left of Figure nj (along the y-axis at = 0), clustered about their 

respective lines. Indeed, we then performed the above experiments for the trivial choice of A^"'^ 
and the sample mean and sample standard deviation of e'^{X^'^\y^'^^) for the 1000 Monte 
Carlo replicates when n = 10000 were as follows: 





mean, st.dev. of e^(A' 


(n)^y{n)-^ with PCA 


mean, st.dev. of e2(A'("), 3^(") 


) with trivial and S^") 


/3 = 





3.9546, 


0.0170 


3.9534, 


0.0178 


(3 = 


.1 


3.7903, 


0.0618 


3.6003, 


0.0277 


/3 = 


.2 


3.5774, 


0.1179 


3.1994, 


0.0276 


/3 = 


.3 


3.3413, 


0.1796 


2.7990, 


0.0254 


/3 = 


.4 


3.0942, 


0.2429 


2.4006, 


0.0230 


/3 = 


.5 


2.7918, 


0.3043 


1.9999, 


0.0210 


/3 = 


.6 


2.4581, 


0.3658 


1.5996, 


0.0177 


/3 = 


.7 


2.0331, 


0.4283 


1.2007, 


0.0140 


/3 = 


.8 


1.5368, 


0.4567 


0.8003, 


0.0103 


/3 = 


.9 


0.9232, 


0.4607 


0.4001, 


0.0054 


/3 = 


99 


0.1352, 


0.2057 


0.0400, 


0.0006 
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Figure 1: Plots of e2(A'("), vsrf2(^W,i3W) when Cov(X) =Cov(Y) = Iq, Cov(X,Y) =/3-/6. 
For each of /3 = (blue), .1 (green), .2 (red), .3 (cyan), .4 (magenta), .5 (blue), .6 (green), .7 (red), 
.8 (cyan), .9 (magenta), .99 (blue), for each of n = 1000 (left) and n = 10000 (right), there were 
1000 Monte Carlo replicates using k = 2. Note that the axis-values are to be multiplied by 2k, 
which is 4 here, since the ranges of e2(A'("),3^W) and ^^(^W^ijW) are the interval [0,2k]. 
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Indeed, besides the notable exception when (3 = (where there is no correlation anyway between X 
and Y), the values of e'^{X^"'\ y^''^^) were substantially larger when PCA was used to generate A^"''' 
and B^"'^ than for the trivial choice of A^"'^ and B^"'\ This is the incommensurability phenomenon, 
a situation where use of PCA has the consequence of inordinately large Procrustean fitting-error. 

Let us call the values e2(A'("), 3^(")) - [(1 ■2A; + /3 •^^(^W, S^"))] residuals. It is noteworthy 
that in the above experiments the sample standard deviation of the residuals when PCA was used 
to generate A^"'^ and B^""^ is very close to the sample standard deviation of e'^{X^'^\y^'^^) for the 
trivial choice of A^"'^ and B^''^\ Specifically, we computed: 





st.dev. of residuals with PCA 


st.dev. of e2(A'("),3^(")) with trivial and S^") 


/3 = 





0.0170 


0.0178 


/3 = 


.1 


0.0267 


0.0277 


/3 = 


.2 


0.0262 


0.0276 


/3 = 


.3 


0.0252 


0.0254 


/3 = 


.4 


0.0235 


0.0230 


/3 = 


.5 


0.0214 


0.0210 


/3 = 


.6 


0.0192 


0.0177 


(3 = 


.7 


0.0158 


0.0140 


(3 = 


.8 


0.0118 


0.0103 


(3 = 


.9 


0.0073 


0.0054 


P = 


99 


0.0025 


0.0006 



So, it seems empirically here that the variation in e^(A'*^"\ 3^*^"^) not explained by cP'{A^'^\B'^'^'>) 
when PCA generates v4*-"^ and S*-"^ is approximately the same as the variation in e^{X^''^\y'^'^'>) 
for the trivial choice of and B^^^ (in which (/^(^^"^ iS*^"-*) = identically) and, as such, 
d?[A^'^'\ B^^^) explains all of the rest of the variation here in e^{X^'^''\ 3^^")) when PCA is used. 



4.2 A second illustration 

Our next illustration of Theorem |2] and Theorem |5] is with X and Y multivariate normal (with 
mean vector of all zeros) such that Cov(X) =Cov(Y) = the diagonal matrix in H,20x2o ^j^]-^ 7 q\\ 
diagonals except for the first diagonal, which has the value 1, and such that Cov(X, Y) = .6 * /2o- 
So we are using m = 20 here. As above, g2(^(n)^ = rf2(^^(n)^ g(n)) because Cov(X,Y) is a 
scalar multiple of the identity. 

We will use three different projection dimensions, each of A; = 1, 2, 10. When k = 1 the formula 
in Theorem [2] yields p = y = .6, when k = 2 the formula yields p = ~ .7059, and when 
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k = W the formula yields p = ^ .8219. 

Using PCA to generate A^"''' and B^"'\ we obtained 10000 realizations of and 3^*^""-* when 
n = 10000, for each projection dimension k = 1, k = 2, and k = 10; the values of e'^{X^'^\y^"''>) 
are plotted against the respective values of d'^{A^^\ in the left figure of Figure |2| with k = 1 
in blue, A; = 2 in red, and = 3 in green. As before, lines are drawn on the figure to indicate 
the limiting relationship between e^(A''-"'\ 3^*^"'^) and (P{A^'^\ B^'^^) that is predicted by Theorem [2] 
and Theorem [5j indeed, the scatter plots adhere very closely to these respective lines. In the right 
hand side of Figure [2] is 2000 Monte Carlo simulations when k = 2 for each of n = 10^ (yellow), 
n = 10^ (blue), n = 10^ (magenta) and n = 10^ (black). As n is getting larger, these are seen 
to get increasingly closer to the corresponding limiting relationship between e'^{X^"'\y^"'^) and 
(P{A^"'\ B^^^). All of this supports the claims of Theorem [2] and Theorem [s] 

In the experiments for the left figure in Figure [2| the sample mean and sample standard 
deviation of — — ttt^ — - were as follows: 





sample mean of — — — - 


sample standard deviation of ^ ^'^^2!''^^ 


k = 1 

k = 2 
k = 10 


.4017 
.4323 
.2797 


.0069 
.0950 
.0244 



(We normalize e'^{X^^\y'^^'^) with division by 2k since the range of e^(A'("'\ 3^^"^) is [0,2/c]. As 
k increases, the correlation p increases, so it would seem at first thought that the normalized 
Procrustean fitting-error ^ ^'^^2k^^ should decrease. Indeed, the leftmost green points in (the left 
figure of) Figure [2] are below the leftmost red points, which are below the leftmost blue points. 
However, overall, the normalized Procrustean fitting-error is seen in the table above to be much 
higher in the case of k = 2 than the case of A; = 1. This is explained by noting a substantial gap 
between the first eigenvalue of CovX and the second eigenvalue of CovX (1 vs .7) whereas there is 
no gap between the second eigenvalue of of CovX and the third eigenvalue of CovX (both are .7). 
Thus when k = 1 the PCA projection has little variance whereas when k = 2 the PCA projection 
has much variance, often causing much larger Hausdorff distance between A^^^ and B^"'\ which 
results in larger Procrustean fitting-error by Theorem [2j As such, the case of = 2 is an example 
of the incommensurability phenomenon of inordinately large Procrustean fitting-error. But then 
observe that when = 10 we find that the normalized Procrustean fitting-error is competitive with 
the k = 1 case; even though the tenth and eleventh eigenvalues of CovX are the same, nonetheless 
the correlation p has increased, and the variance of the PCA projection has decreased enough to 
improve the normalized Procrustean fitting-error to be competitive with the case of = 1. 
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Figure 2: Plots of e2(A'("), 3^(")) vs (^^(^W^ ijW) for Cov(X) = Cov(Y) = diag{l, .7, .7,..., .7) G 
j^20x2o^ Cov(X, Y) = .6 * /20- The figure on the left shows 10000 Monte Carlo replications when 
n = 10000, for each of A; = 1 (blue), k = 2 (red), and A; = 10 (green). The figure on the right shows 
2000 Monte Carlo replications when k = 2, for each of n = 10^ (yellow), n = 10^ (cyan), n = 10^ 
(magenta), and n = 10^ (black). Note that the axis- values are to be multiplied by 2k for the 
respective values of k, since the ranges of e2(Af("),3^(")) and ^^(^H^^W) are the interval [0,2fc]. 
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Not only may the incommensurability phenomenon occur when there is no spectral gap in the 
covariance structure at the projection dimension, but the incommensurability phenomenon may 
occur when this spectral gap is positive but small. Indeed, repeating the experiments performed 
for the left figure in Figure [2| and just changing the second diagonal of Cov(X) =CovY from .7 
to A for each of A = .71, .72, .73, .74, .75 but otherwise the experiments are the same, we got a 
very similar-looking scatter plot as the left figure in Figure [2| and the sample mean and sample 
standard deviation of ^'^^J,''^' were as follows: 



sample mean of 
sample mean of 
sample mean of 
sample mean of 
sample mean of 



2k 



Ik 

;2(;t^(n)^-y(n 



2k 



2k 



2k 



sample stdev of 
sample stdev of 
sample stdev of 
sample stdev of 
sample stdev of 



2k 

;2(;f(ri)^y(n 



2k 



2k 



2k 



when A 
when A 
when A 
when A 
when A 



.71 
.72 
.73 
.74 

.75 



2k 



when A 
when A 
when A 
when A 
when A 



.71 
.72 
.73 
.74 
.75 



1 




.0068 
.0070 
.0069 
.0070 
.0070 



.0960 
.0953 
.0922 
.0832 
.0662 



10 



.2807 
.2809 
.2810 
.2815 
.2815 



.0244 
.0243 
.0245 
.0244 
.0242 



In the case of = 2, the spectral gap in the covariance structure at the projection dimension is 
A— .7, and note that as this gap grows to .75— .7 = .05 there is a lessening of the incommensurability 
phenomenon, but the phenomenon is still very much present. Indeed (from the table above), when 



A = .75, the sample mean of 



2k 



when k = 2 (see table above) is below the sample mean 



when k = 1, but it is only lower by less than a half of the sample standard deviation of ^ ^'^^2!''^' 



when k 



when k = 2 and, in fact, notice that the sample standard deviation of — — 2^ 

is more than 9 times the sample standard deviation when k = 1. Thus there is a significant 
probability of an inordinately high Procrustean fitting error in the case of A; = 2 with A = .75. 



4.3 A modification of the second illustration to illustrate the isometry- 
corrective property of 9^ 

Our next illustration of Theorem |2] is with X and Y distributed multivariate normal, with joint 
covariance matrix given by: 
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Of course, this is exactly the illustration in the beginning of Section 4.2 , with the only exception 
that the coordinates of Y have been permuted into reverse order. Performing the very same 
experiments from the beginning of Section E§ the scatter plots of e2(A'("), 3^(")) vs ^^(^W^i^W) 



will not look like the scatter plots in Figure |2j However, since the permutation transformation is 

that the scatter plots of e^{X^"\ 3^(")) 



3.4 



an isometry, we then have by Proposition |8j in Section 
vs 5^(^*^"\ i?'-"^) will indeed look like the scatter plots in Figure [2j The use of 5^ automatically 
accounts for isometrical transformations of X and/or Y from a common frame, in the manner of 
this example. 

It should also be mentioned that, for the illustration of this section (with the covariance matrix 
above), if ^'-"^ and B^"'^ were not generated with PGA, but instead A^''^^ and B^"'^ were selected 
to be (the same as each other by setting them to be) the span of any number of standard-basis 
vectors in R^'' then the Procrustes fitting-error would be disasterously large. The fact that such a 



naive choice of A^"^^ and B^"^^ was successful in the illustration in Section 4.1 was just a byproduct 
of the good fortune that X and Y did not have permuted coordinates or any other isometrical 
transformation applied to them. 
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